Skip to content

test: use default agent preset for integration tests#1996

Open
xingyaoww wants to merge 10 commits intomainfrom
openhands/use-default-preset-for-integration-tests
Open

test: use default agent preset for integration tests#1996
xingyaoww wants to merge 10 commits intomainfrom
openhands/use-default-preset-for-integration-tests

Conversation

@xingyaoww
Copy link
Copy Markdown
Collaborator

@xingyaoww xingyaoww commented Feb 11, 2026

Summary

Update integration tests to use the default agent preset from openhands.tools.preset.default, ensuring tests validate the same agent configuration shipped to production (GUI/CLI).

Changes:

  • tests/integration/base.py: Updated BaseIntegrationTest to:

    • Import and use get_default_tools() from the preset
    • Add enable_browser property (default False) for tests to override
    • Make tools property non-abstract with default implementation using preset
  • Integration tests (t01-t08): Removed manual tool registration, now using default preset:

    • t01_fix_simple_typo.py
    • t02_add_bash_hello.py
    • t03_jupyter_write_file.py
    • t04_git_staging.py
    • t05_simple_browsing.py (with enable_browser=True)
    • t06_github_pr_browsing.py (with enable_browser=True)
    • t07_interactive_commands.py
    • t08_image_file_viewing.py
  • Behavior tests: Updated default_behavior_tools() in behavior_helpers.py to use preset

  • b05_do_not_create_redundant_files.py: Removed manual tool registration

Net change: -85 lines of code

This ensures integration tests validate the production agent configuration, catching any issues before they reach users.

Fixes #372

Checklist

  • If the PR is changing/adding functionality, are there tests to reflect this?
  • If there is an example, have you run the example to make sure that it works?
  • If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
  • If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
  • Is the github CI passing?

@xingyaoww can click here to continue refining the PR


Agent Server images for this PR

GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant Architectures Base Image Docs / Tags
java amd64, arm64 eclipse-temurin:17-jdk Link
python amd64, arm64 nikolaik/python-nodejs:python3.13-nodejs22-slim Link
golang amd64, arm64 golang:1.21-bookworm Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:1f4e52c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-1f4e52c-python \
  ghcr.io/openhands/agent-server:1f4e52c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:1f4e52c-golang-amd64
ghcr.io/openhands/agent-server:1f4e52c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:1f4e52c-golang-arm64
ghcr.io/openhands/agent-server:1f4e52c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:1f4e52c-java-amd64
ghcr.io/openhands/agent-server:1f4e52c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:1f4e52c-java-arm64
ghcr.io/openhands/agent-server:1f4e52c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:1f4e52c-python-amd64
ghcr.io/openhands/agent-server:1f4e52c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:1f4e52c-python-arm64
ghcr.io/openhands/agent-server:1f4e52c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:1f4e52c-golang
ghcr.io/openhands/agent-server:1f4e52c-java
ghcr.io/openhands/agent-server:1f4e52c-python

About Multi-Architecture Support

  • Each variant tag (e.g., 1f4e52c-python) is a multi-arch manifest supporting both amd64 and arm64
  • Docker automatically pulls the correct architecture for your platform
  • Individual architecture tags (e.g., 1f4e52c-python-amd64) are also available if needed

Update integration tests to use the default agent preset from
openhands.tools.preset.default, ensuring tests validate the same
agent configuration shipped to production (GUI/CLI).

Changes:
- Update BaseIntegrationTest to use get_default_tools() by default
- Add enable_browser property (default False) for tests to override
- Remove manual tool registration from t01-t08 tests
- Update behavior test helpers to use default preset
- Add enable_browser=True for browsing tests (t05, t06)

This ensures integration tests validate the production agent
configuration, catching any issues before they reach users.

Fixes #372

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Good Refactoring with Minor Design Smell

Verdict: ✅ Worth merging - Solid refactoring that removes duplication and aligns tests with production config.

Key Insight: The -85 line reduction and single source of truth for tool configuration is excellent. The enable_browser flag works but represents a scaling problem for future tool configurations.

See inline comments for specific feedback.

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Feb 11, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • PR Review by OpenHands

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1996 at branch `openhands/use-default-preset-for-integration-tests`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Copy Markdown
Collaborator

@enyst enyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you'd like to negotiate with the reviewer agent, LGTM 😅

@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Feb 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@enyst enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026
@xingyaoww
Copy link
Copy Markdown
Collaborator Author

@OpenHands please check the github workflow in this PR https://github.com/OpenHands/software-agent-sdk/actions/runs/21925641027/job/63317348622

Tell me why the integration test in this PR never finish running?

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Feb 15, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Feb 15, 2026

Analysis: Why the Integration Test Never Finishes

After analyzing the GitHub workflow logs and the PR changes, I've identified the root cause of why the integration tests are hanging for 6+ hours before being cancelled.

Root Cause

The PR changes t05_simple_browsing.py and t06_github_pr_browsing.py to use browser tools by setting enable_browser=True, which includes the full BrowserToolSet (14 browser-related tools).

The problem is:

  1. Original design: These tests were designed to work with terminal tools only (curl/wget). The test instruction "Browse localhost:8000, and tell me the ultimate answer to life" was meant to be completed using terminal commands.

  2. PR change: By setting enable_browser=True, the tests now include browser tools. The agent attempts to use actual browser automation (Chromium via Playwright) to complete the task.

  3. Why it hangs:

    • Each browser action has a 300-second (5-minute) timeout in BrowserToolExecutor.__call__
    • The test has max_iteration_per_run=100 iterations
    • Worst case: 100 iterations × 300 seconds = 8.3 hours per test
    • The job was cancelled after ~6 hours, which aligns with this calculation
  4. Additional factors:

    • Browser operations are inherently slower than terminal commands
    • The agent may struggle to complete the task efficiently with browser tools
    • The test wasn't designed for browser-based interaction (the HTML page requires clicking a button to reveal the answer)

Evidence from Logs

  • Job started: 2026-02-11T22:31:34Z
  • Job cancelled: 2026-02-12T04:36:23Z (~6 hours later)
  • Step "Run integration test evaluation" was stuck in in_progress state
  • No output logs were captured for the test execution step

Recommended Fix

The tests t05_simple_browsing.py and t06_github_pr_browsing.py should NOT set enable_browser=True. They should use the default enable_browser=False and rely on terminal tools as originally designed.

Alternatively, if browser testing is desired:

  1. Create separate browser-specific tests with appropriate timeouts
  2. Reduce max_iteration_per_run for browser tests
  3. Add per-test timeouts to prevent runaway execution

View full conversation

@xingyaoww xingyaoww removed the integration-test Runs the integration tests and comments the results label Feb 15, 2026
@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 15, 2026 — with OpenHands AI
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Co-authored-by: openhands <openhands@all-hands.dev>
@xingyaoww xingyaoww removed the integration-test Runs the integration tests and comments the results label Feb 15, 2026
@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 15, 2026 — with OpenHands AI
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@OpenHands OpenHands deleted a comment from all-hands-bot Feb 27, 2026
@OpenHands OpenHands deleted a comment from all-hands-bot Feb 27, 2026
@enyst enyst added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Feb 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@xingyaoww xingyaoww added the integration-test Runs the integration tests and comments the results label Mar 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@xingyaoww
Copy link
Copy Markdown
Collaborator Author

@OpenHands review https://github.com/OpenHands/software-agent-sdk/actions/runs/22763604507/job/66025865386?pr=1996 - why it gets stuck there? I think we need to setup chromium for the integration test too, similar to the test-examples tests

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 6, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

Browser-enabled integration tests (t05, t06) require Chromium to be
available. Without a system-installed Chromium, the browser tool's
auto-install via uvx playwright hangs or installs without system deps,
causing the entire test run to get stuck indefinitely.

This mirrors the Chromium setup already present in run-examples.yml.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 6, 2026

Summary

Problem: The integration tests in PR #1996 were getting stuck indefinitely because the integration-runner.yml workflow lacked Chromium installation. Tests t05_simple_browsing.py and t06_github_pr_browsing.py enable browser tools (enable_browser=True), which require Chromium. Without it, the browser tool's auto-install fallback (uvx playwright install chromium --with-deps) either hangs or installs without proper system dependencies, causing the entire test run to stall.

Fix: Added an "Install Chromium" step to the run-integration-tests job in .github/workflows/integration-runner.yml, mirroring the identical setup already present in run-examples.yml:

- name: Install Chromium
  run: |
      sudo apt-get update
      sudo apt-get install -y chromium-browser

Status: The change (1 file, +5 lines) has already been committed and pushed to the openhands/use-default-preset-for-integration-tests branch as commit b96e8bf8.

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Mar 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@xingyaoww
Copy link
Copy Markdown
Collaborator Author

@OpenHands please check the workflow https://github.com/OpenHands/software-agent-sdk/actions/runs/22766812391/job/66036717130?pr=1996 and the PR context, why it is still broken after adding chromium

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 6, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

After browser-enabled tests (t05, t06) complete, the Chrome process and
its async event loop remain alive in ProcessPoolExecutor workers because
LocalConversation.close() is never explicitly called. This prevents the
worker processes from exiting, causing the CI job to hang indefinitely.

Call self.conversation.close() in teardown() to ensure all tool executors
(including BrowserToolExecutor) are properly shut down after each test.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Mar 6, 2026

Summary

Problem

The integration tests in PR #1996 were hanging indefinitely (3+ hours until GitHub Actions timeout) even after adding chromium-browser installation to the CI workflow.

Root Cause

The browser tests (t05, t06) were passing successfully — chromium was working fine. The hang occurred after all tests completed because:

  1. BrowserToolExecutor launches a Chrome process with async event loops and background threads
  2. After tests finish, BaseIntegrationTest.teardown() was empty — it never called LocalConversation.close()
  3. The orphan Chrome process and its background threads kept ProcessPoolExecutor worker processes alive indefinitely

Evidence from the cancelled run's cleanup confirmed orphan chrome and chrome_crashpad_handler processes.

Fix (commit f5a38f12)

Added self.conversation.close() to BaseIntegrationTest.teardown() in tests/integration/base.py. This calls LocalConversation.close() which already properly shuts down all tool executors (including BrowserToolExecutor → kills Chrome, drains async events, closes the async executor).

Checklist

  • ✅ Root cause identified and explained
  • ✅ Minimal one-line fix — only added self.conversation.close() to the existing teardown() method
  • ✅ Pre-commit hooks all pass (ruff format, ruff lint, pycodestyle, pyright, import checks)
  • ✅ Changes committed and pushed to the PR branch

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Mar 6, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 6, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

4 similar comments
@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@all-hands-bot
Copy link
Copy Markdown
Collaborator

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

@xingyaoww xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Apr 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 4, 2026

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

@xingyaoww
Copy link
Copy Markdown
Collaborator Author

@OpenHands pls understand the context of this pr, then understand why https://github.com/OpenHands/software-agent-sdk/actions/runs/23972109855/job/69930673480?pr=1996 failed and fix it

@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Apr 5, 2026

I'm on it! xingyaoww can track my progress at all-hands.dev

- Rename t01-t04, t07-t08 test files back from _t*.py to t*.py so the
  test runner discovers them again (they were temporarily disabled for
  debugging but never re-enabled).

- Change ProcessPoolExecutor to use shutdown(wait=False) after all
  results are collected, preventing the runner from hanging indefinitely
  when browser/Chrome child processes survive worker teardown.

- Clear BrowserToolSet._shared_executor class variable on close() so
  stale references don't prevent worker process exit.

Co-authored-by: openhands <openhands@all-hands.dev>
@openhands-ai
Copy link
Copy Markdown

openhands-ai bot commented Apr 5, 2026

Summary of Changes

The PR comment asked me to understand the PR context, diagnose why the CI workflow failed, and fix it.

Was the request fully addressed?

  • Understood the PR context: PR test: use default agent preset for integration tests #1996 updates integration tests to use the default agent preset from openhands.tools.preset.default instead of manual tool registration.
  • Diagnosed the CI failure: All 4 model test jobs timed out at 3 hours because:
    1. Tests t01t04, t07t08 were renamed with a _ prefix (commit c9ef6c99, "temporarily disable non-browser tests") but never re-enabled, leaving only the 2 browser tests (t05, t06) to run.
    2. Both browser tests passed, but ProcessPoolExecutor.shutdown(wait=True) hung indefinitely because orphan Chrome/tmux subprocesses prevented worker process exit.
    3. No test artifacts were uploaded → the consolidate-results job failed with "Results directory not found."
  • Fixed the issue: Pushed commit e7ca120e with three targeted changes.

Are the changes concise?

Yes, all three changes directly address the failure:

  1. Renamed 6 test files back from _t*.pyt*.py so the test runner's [tbc]*.py glob discovers them again.
  2. Changed run_evaluation() in run_infer.py to use executor.shutdown(wait=False, cancel_futures=True) instead of the context manager's implicit shutdown(wait=True), preventing the process hang after all results are collected.
  3. Added cleanup in BrowserToolExecutor.close() to reset BrowserToolSet._shared_executor = None, releasing the stale class-level reference that could prevent worker process exit.

No extraneous changes were made.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 5, 2026

Coverage

Coverage Report •
FileStmtsMissCoverMissing
openhands-tools/openhands/tools/browser_use
   impl.py25915839%57–61, 63–64, 66, 68, 70–73, 75–76, 78, 80, 87, 99–102, 108–109, 114, 116–118, 120–121, 129–131, 133–137, 142, 193, 198–201, 203, 225–227, 230–232, 234, 247, 284–285, 289, 299, 314–315, 320, 334–335, 340–341, 351, 369–370, 372–385, 388–401, 403–404, 410, 415–418, 426, 428, 431–432, 438–439, 444–445, 451–452, 456–457, 461–462, 466, 468–469, 471–474, 477–478, 484, 486, 488, 496–497, 501–502, 507–508, 512–513, 517–518, 523–524, 536–537, 548–549, 553–557, 571–573, 578, 583–584, 594, 600–601
TOTAL220111113949% 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

integration-test Runs the integration tests and comments the results

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Test: use default agent preset for integration tests

4 participants